NYC Education

Introduction

“Achieving inclusive and quality education for all reaffirms the belief that education is one of the most powerful and proven vehicles for sustainable development.” - UN

It should be the goal of any governing organization to ensure high-quality education for all, as its benefits are high and extensive. From the fact that illiteracy means you have a substantially higher likelihood of ending up in jail or on welfare, that illiteracy has a negative impact on discrimination and preventable diseases, or the fact that for every dollar spent on adult illiteracy the ROI (return on investment) is 6.14$ (614%). Another extremely important effect of education is the social network you get, which combats loneliness which in itself has a number of negative health impacts [1].
Given that there is no doubt about the importance of education it’s important to investigate when the educational system fails and people drop out, and which factors have an impact on the dropout. To investigate this we’ll look at poverty data from New York City in 2015, where we focus on education, since this is one of the 17 SDG, and education is important for developing our world in a sustainable direction.

The Data

The data can be obtained from data.cityofnewyork.
It contains 69103 observations and 61 columns of which we only use a subset of 11 columns:

  • Age of person (AGEP)

  • Borough (Bronx, Brooklyn, Manhattan, Queens, Staten Island) (Boro)

  • Disability (DIS)

  • Level of education (no high school, high school, some college, bachelor degree or above) (EducAttain)

  • Ethnicity (white, black, asian hispanic, other)

  • Language other than english spoken at home (yes/no) (LANX)

  • Sex (SEX)

  • Total income

    • Interest, dividends, and net rental income (INTP_adj)

    • Self-employment income (SEMP_adj)

    • Wages or salary income (WAGP_adj)

    • Retirement income (RETP_adj)

We mainly use the peoples educational status, since this is what we are interested in investigating. Additionally we use other above stated features, to see which have an influence on the education.

Furthermore, we are only looking at adults (people older than 24 years old) as people younger than 24 have not had a fair opportunity to finish a bachelor degree. Hence we remove all rows with younger people, this gives us a little less than 50000 participants.

The data is generated annually by a research unit in the Mayor’s office. It is derived from the American Community Survey Public Use Microsample for NYC.

An investigation of the impact of education on salary

An obvious attribute that we would expect education to have an impact on is income. It is intuitive that more education leads to better and more well-paid jobs, so let’s investigate this claim. We’ll do this by looking at the distribution of total income

_images/Report1_8_0.png

Interestingly, the individual with second highest income is (the max line in green) without a high school diploma ie. more than both people with a high school degree and some college. What this means is just that a high salary/income can be obtained without having any education and not that you can expect a lower maximum income if you have some college education. In fact what we see is indeed that you can expect a higher salary the higher your education level. This can be seen in both the average (red line), and median (purple line). Thus it’s easy to conclude that education is an effective tool against poverty. However, it’s important to note that we know nothing of the jobs that people occupy, so a higher salary does not necessarily mean a job that is a “vehicle for sustainable development” [UN].

Sex a hopeful story

SDG 4 says: “Ensure inclusive and equitable quality education and promote lifelong learning opportunities for all” [2] this unsurprisingly also includes women. Worldwide we know there is a discrepancy between males and females from Hans Rosling’s quiz in the opening of his famous book Factfulness: “Worldwide, 30-year-old men have spent 10 years in school, on average. How many years have women of the same age spent in school?” the answer is 9 years [3]. This number is of course not US or NYC specific, thus let’s take a look at the distribution of sexes for the different achieved educations in NYC:

_images/Report1_11_0.png

It’s fairly clear that there is no difference in education obtained between the sexes. So nonetheless they are doing pretty good in NYC regarding gender equality in the educational system

Salary and Sex a sad story

Although the equality of education between the sexes is a good sign, it’s an entirely different and alarming story when looking at sex and salary:

_images/Report1_14_0.png

Here we have a fairly big discrepancy as both the average and median is significantly higher for males (about 70%). This is very alarming as it contradicts our previous conclusion: that higher education means a higher salary and thus an effective tool against poverty. The two figures above suggest that although females have an equal amount of education as men, they still have a lower average salary, and thus a higher likelihood of being in poverty. Now you might think that this low salary could be explained by “Stay-at-home-moms” but there is a difference in salary between men and women even when removing instances of people not earning any money:

_images/Report1_16_0.png

The difference is still significant (about 35%). And even if the the entire difference could be explained by stay-at-home-momes, there is still a question if it should be the case, as this is gender inequality no matter if it is voluntary or not.

Ethnicity, age and education

NYC is a multicultural city with people coming from all ethnicities, does all of them get the same opportunities for educations?

_images/Report1_19_0.png

We do see that for white, Asian, and other over 50% have a higher education, whereas majority of the Hispanic in our dataset have less than high school education which is beneath SDG 4.1. But could this be a historical issue and no longer be the case? Looking at the general education we would expect it to be higher the lower the age (for people older than 24 years old), since the focus on education and resource helping people to get an education has changed dramatically. Additionally, the American society is generally less segregated, especially compared to say the 60s, thus we would expect to see a greater increase in education for all other races.

_images/Report1_21_0.png

The Hispanic race seems to be having the most trouble with obtaining an education even for the younger generation of 20-30, where there are still some age groups where high school education is the most frequent. This may be troublesome for many increasing importance of a college degree steady increasing [5]. But alike our theory the younger generation is having a higher education which is a result of the increase focus in education.

Borough

New York City is divided into five different borough each with it’s own flavor [11]:
Bronx one of the most prominent centres of urban poverty in the United States.
Brooklyn collision of old and new
Manhattan center of NYC and the representative of NYC with central Park, Broadway show and Times Square
Queens primary middle-class families and the most ethnically varied of all the boroughs
Staten Island the most rural part of the city

We wish to see how the borough have influence on the education, and if this reflects the salary of the people. Furthermore we take a look at the ethnicities in the boroughs to see if these maybe are related and can explain the educational situation. To investigate this we have created the following heatmap.

sdjafdm klfdklsæ afnklæew klqræ klæ kldsæafds

Creating a predictive model that should be bad a predictor (but isn’t)

Imagine you’re a school principle and would like to find out which aspirering new kids will attain some level of college education later in life. You do this so you don’t have to waste any time and recourses on people you believe are ultimately undeserving. To do this you gather all the information you can about your new students like: Borough (location), Disability, languages other than English spoken, Ethnicity and Sex.

Ultimately this is a list of attributes that really should not have any influence on the level of educaction an individual will obtain, which is why the ML model used to predict this is hopefully bad.
Interestingly we would expect languages other than English spoken to have a negative impact on the level of education attained, as the majority of Americans only speak English, thus speaking another language than English is an indication the person is of another race. Of course, attributes like Ethnicity and Sex having an influence on education would go directly against SDGs 4, and 5.

Heat maps del her !??!?!

Finally, looking at the heat maps !?!??!!? we see that there are concentrations of ?!?!?!?! ?in =!=!=!=!=!, thus Borough having an impact on education, is another way of saying Ethnicity has an impact. Additionally, we know that a community (Borough) has a positive feedback loop in either direction, so higher education leads to even higher education and vice versa !)!))”)#=)!”=)#=”! see heat map !=”?!”?#=!”?€=!”€=€!”?=€!”€?=.

Classification model

The process of predicting whether an individual will attain some college education or not is called a classification model, and our classification model consists of decision trees.

A decision tree works by asking “yes/no” questions, for instance: “Is this person a male?”. This creates a split in the tree. Based on the answer to the question a new question is asked on each branch, creating new splits. Multiple splits are created in this way, such that for each split we get more and more information about our data. We select the questions such that the two resulting subgroups from the split are as different as possible, and the data points within each subgroup are as similar as possible. To calculate this we can use a measure called entropy. Finally, we have split the data into multiple subgroups, where for each subgroup there will be a higher probability of predicting the right target class, than if we had not asked any questions.

We then create multiple of these decision trees, all different and uncorrelated. To do a prediction we can then combine these decision trees and predict what most of the trees are predicting. This is what is called a random forest. The random forest minimizes errors in the classification because we get inputs from multiple decision trees, hence one wrong tree prediction will not make a difference as long as most trees predict correctly.

Finally, we are also performing a randomized search to select the best random forest model. The randomized search simply creates multiple random forest models with different parameters such as “number of trees in each forest”, “maximum number of levels in each tree” etc. It then runs and evaluates all models, selecting the random forest which performs best.

Random forest model results

The binary classification model we have created “unfortunately” performs rather well. Unfortunately, this indicates external factors such as race, disability, location, etc. have an impact on achieving higher education in New York City. It means that we based on a person’s rather neutral features can predict whether or not this person has an education (or will get one). It is a feature that we believe should not have an influence on whether or not a person will receive higher education. We can compare our model to a baseline model, which predicts everyone to have higher education. Whereas our model makes a prediction for each person based on the previously mentioned features.

We get the following statistical performance measurements for the random forest model and the baseline model respectively:

Random Forest Classifier

Performance Measurement

Performance

Accuracy

0.65

Precision

0.67

Recall

0.79

F1 score

0.73

Baseline model

Performance Measurement

Performance

Accuracy

0.59

Precision

0.59

Recall

1.00

F1 score

0.74

095u1309urqonfaoioifanoinaoifnasfoinn

Ændre tal

asoifasiojfoiasjfoisj

These measures are all in the range of zero to one, where zero is the worst and 1 is the best. We see that the random forest model has the best accuracy and precision, whereas the baseline model has a better recall and f1 score. The accuracy is a measure of how many correct predictions are divided by the total amount of predictions. Hence our random forest model overall does predict more correctly than the baseline model. Precision and recall are two measures often seen together. One way to explain them is that precision is a measure of quality and recall is a measure of quantity.[precision, recall wiki]. Because precision tells you how many of the higher education predictions are correct, and recall tells you how many of the higher education instances you predicted correct (also known as True positive (TP)). So naturally when predicting every instance to be higher education, the recall will be 1.

Finally, the f1 score is a mean between the two measurements.
Hence to evaluate the model we need to look at what the model should be used for. If we assume a rather uncomfortable thought that the NYC government would predict who gets education in order to know which people to spend recourses on, then their goal is probably to only invest in the people who get higher education. And not invest in a person and risk investing in someone who does not get the education. Here they would prefer a higher precision over a higher recall because this indicated that they seldom invest in the wrong people, but rather invest a little less, though they will miss some potential good investments. Hence in this case our model would actually be useful.

The above-stated thought is morally uncomfortable, but nonetheless, it is the sad truth, that we based on the data of the NYC situation in 2015 can see the segregation from the model and could use people’s neutral features to predict education.

Talking about segregation we can look into our model to see if it is biased, and in which areas it is most biased. In our case, we know that the model is biased since we only included features that should not have an influence. But we can take a look at which features have the most influence, and hence where the model is most biased. Therefore we have plotted the importance of the different features. This is simply a measure of how important each feature is for the prediction.

_images/Report2_6_0.png

We see that ethnicity_1 (“Non-Hispanic White”) is the most important, and ethnicity_4 (“Hispanic, Any Race”) is the third most important. (With Boro_3, which is Manhattan being the second most important). Hence the model sees the person’s race as an important attribute to predict the education level attained. This does not seem fair.
Luckily the sex does not seem that important, which is exactly what we saw in

reference til plot her.

Next, we have plotted the normalized confusion matrix of the model’s overall performance. It shows the ratio between the actual educational status and the predicted educational status. These numbers are used to calculate the statistics we use to evaluate the model before. From the matrix we see that our model is better at predicting the people who do get an education than the people who don’t get an education. Hence the model leans towards predicting that people will get a higher level education.

Finally, we go back and look if our concern about the model being very biased towards race holds true. To investigate this we have plotted the difference in the performance of the model for three ethnicities: Caucasian, African-American, and Hispanic.

om vi gøre dettte eller ej, altså for 2 eller 3 ethnicities

On the y-axis we have the different labels TN (true negative: correctly predicted no higher education), FP (false positive: predicted higher education, but had no higher education), FN (false negative: predicted no higher education, but had higher education), and TP (true positive: correctly predicted higher education). The x-axis shows the difference between the chosen race and the overall model, which is not divided.
From the plot, we see that for the Caucasians the model has much fewer true negatives and false negatives, and a lot more false positives and true positives. This means the model very seldom predicts a white person not to have an education, and most frequently predicts Caucasians to have a higher education. For Hispanic and African-American people we see an opposite trend. Meaning the model tends towards predicting black and Hispanic people don’t have an education. Hence we were right in our concern about the bias in the model. This bias is simply due to the data being biased, indicating racial segregation in the city of New York. This is an unfortunate fact we cannot change.

But if we would use the model to predict education and do not want the model to be unfair to some races, we can debias the model according to the race. We have chosen Caucasian and Hispanic, since these are the most influential, and included African-American as well since there is a lot of history regarding black peoples’ educational rights (and of course rights in general). The idea behind debiasing is to make the model equally fair for all races. The way we wish to make the model fair is by getting a similar true positive rate \((\frac{TP}{TP+FN})\) and false positive rate \((\frac{FP}{FP+TN})\) for the races. Ideally, we want a high true positive rate and a low false positive rate.

For our current overall model we have a TPR (true positive rate) of 0.79, and an FPR (false positive rate) of 0.54. But for the CAucasians, both of these measurements are higher, and for the Hispanics, both are much lower. We cannot change the model, but we can change how the model predicts. The model gives each person a probability of that person having a higher education. Normally if the probability is above 0.5 (50%), it predicts the person to have a higher education, if it’s under it predicts the person to not have a higher education. The probability of 0.5 is called the threshold. To debias the model we can change the threshold based on which race we are looking at. We will find which thresholds give us the best and most similar TPR’s and FPR’s for whites and Hispanics respectively.

To do this we’ve calculated TPR’s and FPR’s for both races for some different thresholds. We then plot the FPR on the x-axis and the TPR on the y-axis, where each point corresponds to a threshold. This is called a RUC curve.

Loading BokehJS ...

Evaluating the RUC curve visually, we want to find a high TPR, low FPR, and three thresholds (points) that are close to each other making the rates for the races similar. We have selected the following three thresholds:

African-American: 0.53
Caucasian: 0.68
Hispanic: 0.42

dette skal nok tjekkes mht. værdier

This means that for African-Americans the probability needs to be above 0.47 to predict higher education, whereas for Caucasians it needs to be above 0.64, and for Hispanics it only needs to be above 0.37.

The specific values for the thresholds that debiases our model reveals how segregated the population is. We see how much we need to change the threshold according to eachother in other to achieve a fair model. Specifically, the threshold for Hispanics is almost two thirds of Caucasians.

We can now take a look at the effect of our debiasing. To do this we have plotted the TPR and FPR for each race, before and after the debiasing, to see how we have minimized the difference between the races.

_images/Report2_10_0.png

Hence after debiasing our chosen thresholds results in the following rates:

TPR

FPR

African-american

0.59

0.45

Caucasian

0.65

0.34

Hispanic

0.66

0.49

So now we have a quite fair model regarding the ethnicities (the rates are almost similar for the three race), and it performs rather okay. The TPR are certainly higher than the FPR, hence it predicts correct more than it predicts wrong.